﻿README for Scalable deep learning to identify brick kilns and aid regulatory capacity data and code


# Data Availability Statement
We used DigitalGlobe (now Maxar Technologies) imagery, which is proprietary. However, we are able to share a "de-geo-identified" version of the imagery used in our machine learning pipeline. This data is stripped of its geospatial metadata, which means certain steps of our pipeline, such as translating the pixel locations of kilns detected within an image to a latitude and longitude will not be reproducible with this data. We also provide the final output of the machine learning pipeline (the GPS coordinates of all model predicted kilns), so the remainder of the analysis is replicable. The Government of Bangladesh data on brick kilns is proprietary and can be requested directly from the Department of Environment. All other geospatial data used to assess the scale and scope of brick manufacturing (population distributions, forests, schools, health facilities, protected areas, PM2.5, wind, precipitation, temperature, and administrative boundaries) are publicly available and we provide details on how to obtain them. All data along with Python and R scripts for the entire machine learning pipeline and spatial analysis are available on the Harvard Dataverse at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HVGW8L.

# Directory Structure
We have split the two parts of the paper into `machine-learning` and `spatial-analysis` subdirectories. All data and scripts for the machine learning pipeline are in the `machine-learning/data` and `machine-learning/scripts` directories, respectively. Likewise, data and scripts for the spatial analysis are in `spatial-analysis/data` and `spatial-analysis/scripts` directories, respectively.


# Overview of Machine Learning Pipeline
1. Train kiln classification model. Evaluate model on images.
2. Search for connected components of adjacent yes-kiln images and acquire signal masses. Get centroids of masses. Convert centroids to geo-coordinates (not possible with de-geo-identified data, but script included for future users who run on their own arbitrary data).
3. Train shape classification model. Evaluate model on images.


# Step 1. Train kiln classification model

## Data
We provide HDF5 files for the final train, and validation dataset: `final_train.hdf5`, and `final_val.hdf5`. 
Each file contains two keys, `image` and `label`. 
* `image`: data has the shape `(x, 256, 256, 3)`, where `x` is the number of datapoints/images, 3 is the number of channels (red, green, blue, respectively), and 256 x 256 is the image size. 
* `label`: key has shape `(x,)`, i.e. the label for each datapoint (0 for `nokiln`, 1 for `yeskiln`).


Data composition: 
* `final_train.hdf5` : Contains 2186 images. 269 of them are positive (contains kilns) and 1917 are negative (does not contain any kiln).
* `final_val.hdf5`: Contains 618 images. 73 of them are positive, and 545 are negative.


Note: As stated in the Data Availability Statement, the images have been de-geo-identified, meaning metadata such as the latitude and longitude of images have been stripped and only the RGB data is provided. 


## Scripts
First, make sure your environment is set up with the necessary dependencies. We use the `virtualenv` library to create a virtual environment.


* Create a new virtual environment: `virtualenv .env`


* Enter the environment: `source .env/bin/activate`


* Install the dependencies: `pip install -r requirements.txt`


* Train a model: If one wants to save the model in a directory whose absolute path is “dir”, then one should use the following command (assuming the code and data file is in directory, and the working directory, while running the code, is also the same):


```python train_and_eval_classifier.py --mode train --model vgg16_cam --train_filename final_train.hdf5 --test_filename final_val.hdf5 --num_epochs 9 --save_dir dir/```


In case the files are in different directories, please change the file paths in the argument given to the script accordingly. Also note that, other than the “number of epochs”, the other default argument values in the training script are the hyper-parameters we used in training our own model, selected after a few iterations of hyper-parameter tuning.


* Evaluate the model on the validation dataset: First, pick any checkpoint number “num” from the directory the model is saved to (“dir”). Then we can use the following command (under the same assumptions as before) to evaluate the model on the validation dataset: 


```python train_and_eval_classifier.py --mode eval --model vgg16_cam --save_dir dir --test_filename final_val.hdf5 --results_csv kiln_results.csv --checkpoint num```


In the above case, the results are saved as “kiln_results.csv”. Users are requested to change the file paths accordingly.


Evaluation result format: In the above case, “kiln_results.csv” would have two columns, “prob” and “prediction”. The i-th value in the “prob” column represents the model assigned probability for the i-th image in the dataset, and similarly the i-th value in the “prediction” column represents the model’s prediction (0 or 1) for the i-th image.


# Step 2. Get connected components of adjacent yes-kiln images. Acquire pixel masses and calculate centroids of masses. Convert centroids to coordinates. 


## Data
Due to the data being de-geo-identified, this step of the pipeline cannot be run immediately. Instead, the user must prepare the input into an expected format described as follows. 


The user must have a csv file with columns: “path” for location of the image, “row” and “col” for the row and column of the image respectively (considering all images to be in a grid), “lat” and “lon” for the latitude and longitude of the top-left corner of the image respectively, and “prediction” from Step 1.


## Scripts
Now we can use the model we’ve trained to recover the class activation mappings. Since we’ve identified the yes-kiln images in Step 1, we simply have to rerun the model on them to collect the signals. 


In order to merge kilns that have been split across multiple images, we create connected component images, which combine yes-kiln images that are adjacent to each other into one single image. Then, we can process the pixel masses to be contiguous regions. For each discrete pixel mass, we calculate its centroid and then translate it into geocoordinates.


```
python find_coords.py --mode eval --save_dir dir/ --checkpoint num --eval_csv test_area.csv
```


The results will be saved in an output csv with a default name of the given eval_csv path with `_coords.csv` as the extension. If the user runs the command with the `--save_cc_img` flag, the connected component images will be saved, and the output csv will also record for each coordinate what image from which it was sourced.


# Step 3. Train shape classification model


## Data
We consider two types of kilns - “Zigzag” kilns, which are rectangular shaped, and “Fixed Chimney” Kilns (also referred to as FCKs), which are ovular shaped. We provide two HDF5 files: `shape_train.hdf5` and `shape_val.hdf5` for training and validating a model, respectively.


Each file contains two keys, `image` and `label`.
* `image`: data has the shape `(x, 256, 256, 3)`, where `x` is the number of datapoints/images, 3 is the number of channels (red, green, blue, respectively), and 256 x 256 is the image size. 
* `label`: key has shape `(x,)`, i.e. the label for each datapoint (0 for `FCK`, 1 for `Zigzag`).


Data composition:
* `shape_train.hdf5` : Contains 2341 images. 1513 of them are positive (contains a “Zigzag” kiln) and 828 are negative (contains a “Fixed Chimney” kiln or FCK)
* `shape_val.hdf5` : Contains 1004 images. 639 of them are positive (contains a “Zigzag” kiln) and 365 of them are negative (contains a “Fixed Chimney” kiln or FCK)


## Scripts
(First, make sure your environment is compatible, and has the proper dependencies installed. Please check the “Scripts” section in Step 1)


* Train a shape model: ```python train_and_eval_classifier.py --mode train --model vgg16_cam --train_filename shape_train.hdf5 --test_filename shape_val.hdf5 --num_epochs 8 --save_dir dir/```


Note that, other than the “number of epochs”, the other default argument values in the training script are the hyper-parameters we used in training our own model, selected after a few iterations of hyper-parameter tuning.


* Test the shape model: ```python train_and_eval_classifier.py --mode eval --model vgg16_cam --save_dir dir --test_filename shape_val.hdf5 --results_csv shape_results.csv --checkpoint num```


The script saves the models’ predictions in `shape_results.csv` in the above case, and it has the same format as mentioned in the “scripts” section in step 1.




# Spatial Analysis: Scale and Scope of Brick Manufacturing
## Data
We use the following sources of data for the spatial analysis of brick kilns & regulations. For each source of data we denote where it is located or how it can be accessed.
* Kiln coordinates identified by our model (`spatial-analysis/data/clean/model_kilns.csv`)
* Government of Bangladesh Data on Kilns: these government data are proprietary and can be requested directly from the Department of Environment.
    * Registered kilns by district
    * Kilns mapped in the CASE report
* Regulated Entities: these data are all free and publicly available and we denote where to access each of them below
    * [Railways](https://data.humdata.org/dataset/bangladesh-railways)
    * [Health Facilities](https://data.humdata.org/dataset/bangladesh-health-facilities-by-lged)
    * [Education Facilities](https://data.humdata.org/dataset/bangladesh-education-facilities-by-lged)
    * [Forests](https://data.humdata.org/dataset/bangladesh-forest-and-natural-parks)
    * [Protected Areas](https://www.protectedplanet.net/country/BGD)
* Administrative Boundaries: these data are all free and publicly available and we denote where to access each of them below
    * [Administrative levels 0-4](https://data.humdata.org/dataset/administrative-boundaries-of-bangladesh-as-of-2015)
    * We utilize levels 2 (district boundaries)
* Gridded Population Data: these data are all free and publicly available from WorldPop
    * [Population](https://www.worldpop.org/geodata/summary?id=5838)
    * [Pregnancies](https://www.worldpop.org/geodata/summary?id=1005 )
* PM2.5: these data are all free and publicly available from OpenAQ
    * We utilized hourly data on PM2.5 from January 1, 2017 to September 9, 2019 from the station at the [US Embassy in Dhaka](https://openaq.org/#/location/8415)
* Meteorological Data: these data are all free and publicly available from the Copernicus Climate Change Service
    * [ERA5 hourly data on single levels from 1979 to present](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview)
    * We utilized hourly data from January 1, 2017 to September 9, 2019 on
            * 10m u-component of wind
            * 10m v-component of wind
            * 2m dewpoint temperature (K)
            * 2m temperature (K)
            * Total precipitation (m)
    * We also provide the cleaned and processed dataframe we used from ERA5, where data were extracted around the US Embassy in Dhaka for the PM2.5 assessment
            * File: `spatial-analysis/data/dhakaEmbassyDF.Rds`


## Scripts
1. `constructKilnPMData.R`: this script processes, extracts, and cleans the meteorological data from ERA5 and the PM2.5 data from OpenAQ. It then uses the GPS locations of the brick kilns from our model, the location of the US Embassy in Dhaka, and daily wind direction to identify how many kilns are upwind of the embassy on a given day. Calls user-written function `extractCopernicus.R`. Exports `dhakaEmbassyDF.Rds` to use in `pnasKilnAnalysis.R`.

2. `constructKilnPopData.R`: Matches population and pregnancy raster data to kiln coordinates to identify counts that are exposed. Calls user-written function `kilnPopDist.R` for this purpose. Exports `popKilnDF.Rds` and `pregKilnDF.Rds` to use in `pnasKilnAnalysis.R`.

3. `pnasKilnAnalysis.R`: Runs entire spatial analysis for manuscript end-to-end. To replicate the full analysis, users will need to obtain access to the datasets described above. We have provided `model_kilns.csv`, `dhakaEmbassyDF.Rds`, `popKilnDF.Rds` and `pregKilnDF.Rds`. Calls several user-written functions (all provided): `checkViolationDist.R`, `closestKiln.R`, `closestObj.R`.

## Steps
1. Obtain access to requisite data.
2. Set-up directory according to the following structure. **Note that `spatial-analysis` should be set up as as an R project for the easiest use**:
- spatial-analysis/
    - scripts/
        - functions/
    - data/
        - raw/
        - clean/
    - output/
3. Store all raw data in `spatial-analysis/data/raw`
4. Run `constructKilnPMData.R`.
5. Run `constructKilnPopData.R`.
6. Run `pnas_kiln_analysis.R`. Note that **Section 2: Comparison to DOE** will not be possible to run without obtaining access to the proprietary government data. Steps 4 and 5 here can be skipped and users can utilize the output of each of those scripts, which we have provided in the `spatial-analysis/data/clean` directory.
